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Abstract Learning Classifier Systems (LCS) are population- 
based reinforcement learners used in a wide variety of ap- 
plications. This paper presents a LCS where each traditional 
rule is represented by a spiking neural network, a type of 
network with dynamic internal state. We employ a construc- 
tivist model of growth of both neurons and dendrites that 
realise flexible learning by evolving structures of sufficient 
complexity to solve a well-known problem involving con- 
tinuous, real- valued inputs. Additionally, we extend the sys- 
tem to enable temporal state decomposition. By allowing 
our LCS to chain together sequences of heterogeneous ac- 
tions into macro-actions, it is shown to perform optimally in 
a problem where traditional methods can fail to find a so- 
lution in a reasonable amount of time. Our final system is 
tested on a simulated robotics platform. 

Keywords Learning Classifier Systems • Spiking Neural 
Networks • Self-adaptation • Constructivism 



1 Introduction 

Many real-world processes encompass some temporal ele- 
ment that describes or affects the state of a given system 
through time. Artificial Intelligence ubiquitously utilises ab- 
stractions of real-world or natural systems, however many 
representations opt to remove these temporal facets during 
the abstraction process. Missing out on underlying temporal 
patterns in a problem potentially denies a system access to a 
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rich vein of information which can be harnessed to generate 
more efficient solutions. 

Within machine learning, temporal problems are partic- 
ularly prevalent in the field of robot control, whereby robotic 
agents act through a number of time-steps to achieve some 
user-defined goal. To achieve any kind of reasonably com- 
plex behaviour, temporal aspects of the environment must be 
taken into account. This is perhaps most easily realised by 
using some temporally-dependent solution representation to 
match the nature of the problem representation. In this work, 
we concentrate on Spiking Neural Networks (SNNs), which 
are temporally- sensitive abstractions of biological neural net- 
works. They are of particular interest to us as (i) they have 
been successfully applied to myriad robotic control prob- 
lems (ii) there is growing evidence that the more biologically- 
realistic spiking models are efficient forms of knowledge 
representation (e.g. Maass (1997), Saggie-Wexler et a 1 (2006)). 

Our specific approach to temporal machine learning in- 



volves the use of a Learning Classifier System, or LCS (Holland . 
Il976h . a form of online evolutionary reinforcement-based 
learning that evolves a population of ( condition, ac tion, pre- 
dictio n) rules using a Genetic Algorithm (GA) (Holland, 



1975). 



Traditionally, LCSs have used a ternary rule condition 
structure, comprising binary digits {0,1} and a generalisa- 
tion character {#} to match in certain subspaces of a binary- 
represented state space. However, such a representation lim- 
its the application of LCSs to problems which can be binary- 
represented. Rule representations have since been extended 
to handle integer and real-valu ed states using seve ral rep- 
resenta tions including intervals (IWilson (12000) andlWilsonl 
(2001b) respectively), hyperellipsoids (Butz et al, 2006) and 
convex hulls dLanzi and Wils on, 2006), opening LCSs to more 
varied problem domains. 

Further extensions enlarged the rem it of LCS, includ- 
ing L ISP S-expression rul e conditions (Lanzi and PerrucciL 
1999k . computed actions ( Ahluwalia and Bull . Il999k . and 
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fuzzy logic with computed actions d Valenzuela-Rendoni 1 1 99 1|) . 
Artific ial neural n etworks have also been used to the same 
effect 2002). We fol low previous work with LCS us- 

ing neural classifiers, (e.g. lHoward et all ( 2008 )). which has 
shown that it is possible to allow evolution to control both 
the number of hidden layer nodes (termed Neural Construc- 
tivism) and the number of node connections (termed Con- 
nection Selection) during the reinforcement learning pro- 
cess, alongside each classifier's rates of mutation. In partic- 
ular, the work has utilized multi-lay ered perceptrons (MLP) 
( Rumelhart and McClelland . 19861) to generate optimal ac- 
tion selection policies by calculating actions based on the 
input state. 

To date, only one scheme that can utilise temporal in- 
formation has been use d within LC S; a form of recurrent 
Boolean logic network dBulll. 120091) . Here, we combine a 
modern LCS with a temporally dynamic SNN representa- 
tion by replacing each classifier condition/action pair with a 
SNN. It is shown that the use of a SNN allows the classifiers 
to harness the network's persistent, dynamic internal state 
to produce temporally-dependant ac tivation patterns . Addi- 
tionally, we modify our LCS as in ( Studlev and Bull, 2005) 
to create a Temporal Classifier System (TCS), which facili- 
tates generalisation in both time and space, thereby enabling 
more direct use of the temporal behaviour of the neurons for 
determining actions. 

Our hypothesis is that (i) the evolved SNN networks will 
display signifi cant benefits comp ared to previous work using 
MLP network faoward et alLl2009b (ii) the use of temporally- 
sensitive SNN classifiers will extend the remit of XCSF to 
solving a class of problems that include an ele ment of time 
e.g. semi-Markov Decision Processes (MDPs) ( Sutton et"al 
19991) . 

Specifically, we address the following research questions: 
does the use of SNN networks compared to MLP networks 
provide any significant advantage in terms of performance, 
solution size or network parsimony? Does the evolutionary 
design process evolve the networks in different ways? How 
well does the system compare to a benchmark reinforcement 
learner? Is there synergy between TCS and SNN classifiers, 
in that TCS provides a framework that allows the SNNs to 
harness their temporal element more effectively, which in 
turn increases the performance of TCS? 

To address these questions, we test our SNN represen- 
tation against bot h an M LP representation and a tabular Q- 
learner dWatkinsi 1989k . which acts as a benchmark, on a 
stand ard Reinforcement Learning (RL) dSutton and Bartol 
1 1998k test problem (a 2-D grid world). We then modify the 
environment to include a temporal element and test our SNN- 
TCS against an MLP-TCS, to demonstrate the utility of the 
former network representation over the latter. One final mod- 
ification increases the difficulty of the environment by in- 
creasing the amount of steps it takes to reach the goal state; 



again we compare to a Q-learner to demonstrate the ability 
of TCS to generate optimal action policies in a more diffi- 
cult environment by using underlying temporal information 
in the problem. It is further shown that this temporal, spik- 
ing LCS performs optimally in a more challenging, realistic 
simulated robotics task. 

The main contributions of this paper are (i) the combi- 
nation of spiking networks and a temporally-sensitive, self- 
adaptive and constructive LCS framework, which are shown 
to perform optimally in continuous time and space where 
tabular Q-learning performs poorly (ii) the subsequent de- 
ployment of the aforementioned system in a simulated robotics 
task, demonstrating the power of the system in a nontrivial 
semi-MDP. 

2 Neural Representations 

The initial work exploring artificial neural networks within 
LCS used tr aditional feedforward MLPs to represent the rules 
(Bull, 2002). Recurrent MLPs were t hen shown able to pro - 
vide memory for a simple maze task (Bull and Hurst, 2003). 
Radial Basis Function (RBF) networks (e.g. Buhmann (2003)) 



were later used for both real ( Hurst and Bui 



2006) and sim- 



ulated robotics ( Bull and O'Haral 2002k tasks. Ana lysis of 



the rule sets produced in ( Bull and O'Haral 2002 ) shows 
networks evolve that compute different actions that vary de- 
pending upon the environmental input, as long as the correct 
payoff value for those (state, act ion) combinations is iden- 
tical. The authors hypothesize that it may be possible for 
the system to function optimally with only a single neural 
rule per possible payoff value, attesting to the compact rule 
representation and generalization capabilities of the neural 
LCS. Both MLP and RBF networks have been shown amenable 
to a constructionist approach, which can be defined as a 
method whereby the number of nodes within the hidden 
layer is under evol utiona r y control, along with t he connec- 
tion weights dBulll d2002k : burst and Bull d2006k ). We have 
recently extended this vein of research by introducing ex- 
plicit network-wide feature selection, termed "connection 
selection", which allows each inter-neural connection to be 
probabilistically enabled or disabled, with the aim of re- 
ducing the number of connection weights within the rules 
dHoward and BullLl2008k . 

The use of spiking networks is based on the assumption 
that simpler network forms can constrain the types of solu- 
tions evolved by the system in a manner that is detrimental to 
performance or solution parsimony, especially when consid- 
ering temporal factors. Two well-known formal spiking im- 
plementations are the Leaky Integrate and Fire (LIF) model 
and the Spike Response Model (SRM ). They are covered in 
detail in dGerstner and Kistleru2002k . A number of previous 
studies have evolved single spiking networks. Neuroevolu- 
tion (the use of evolution to design neural achitectures) was 
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first applied to spiking networks by Korkin et all ( 1998 ). in 
this case evolving networks that produce temporally-dependant 
outputs. SRM net works were later evolved for vision-based 
robot navigation dFloreano and MattiussiL 2001 ). The authors 
conclude that the inherent dynamics of a spiking network 
may be suited to certain robotics tasks. A survey of various 
methods for evolving both connection weig hts and archi- 
tectures in neural networks is presented in ( Floreano et all 
2008k . To our knowledge, the work presented in this arti- 
cle is the first approach to evolving spiking neurons that in- 
cludes both variable network sizes, connection selection and 
self-adaptive parameters within an ensemble of networks 
(population of classifiers). By combining these elements within 
XCSF, we are able to demonstrate generalization performance 
in excess of our previous MLP-based systems. 

3 XCSF 

XCSF comprises a population of classifiers, the main com- 
ponents of which are the condition, action and prediction. 
The condition is used to determine whether or not the classi- 
fier matches the current s t , the action is the action the classi- 
fier advocates, and the prediction is the predicted payoff the 
classifier expects from carrying out its action given the state 
s t . Each classifier also keeps track of myriad other parame- 
ters, including the last time it was involved in GA activity, 
the accuracy of its prediction value, and the fitness of the 
cla ssifier. For furth er details, we refer the interested reader 
to dWilsorlEoOlah . 

Within XCSF, the fitness of a classifier is related to the 
accuracy of its prediction of payoffs. This leads to a system 
where all areas (not just high reward areas) of the problem 
landscape are covered by classifiers that predict expected 
payoff with a high degree of accuracy. XCSF further in- 
creases its generalization capabilities by computing predicted 
payoffs. That is, classifier prediction is not a c onstant value 
as in other LCS, such as XCS dWilson[[l995l) . Rather, pre- 
diction is calculated linearly as the product of the sensory 
input and a prediction weight vector, which allows the same 
classifier to generalize by predicting different payoff values 
in different areas of the environment. At the start of each 
experiment, each classifier is initialized with a prediction 
weight vector w, used to compute the classifiers prediction. 
This vector has one element for each input, plus an addi- 
tional element w$ which corresponds to jo, a constant input 
that is set as a parameter of XCSF. Each prediction weight 
vector element is initially 0. 

XCSF involves two varieties of trial, exploration and ex- 
ploitation, which are carried out with approximately equal 
frequency. At each time-step, XCSF builds a match set, [M], 
from [P] consisting of all classifiers whose conditions match 
the current input state s t . In the binary case, each classifier 
condition has one element per state variable, where state 



variables are binary and classifier conditions are a ternary 
string augmented with #, where # is a generalisation charac- 
ter that matches any state input at that position. A classifier 
is said to match if it has either (i) the same binary digit as 
the corresponding state or (ii)# at each position in its con- 
dition. In traditional XCSF, each action must be present in 
each [M]. If this is not the case, covering is used to gen- 
erate classifiers that advocate the missing action(s); cover- 
ing generates a new classifier whose condition is a copy of 
the input state, with generalisation characters probabilisti- 
cally inserted. Once [M] is formed, the prediction array is 
created. Classifier prediction (cl.p) is calculated linearly as 
a product of the environmental input (or state, s t ) and the 
prediction weight vector (w) associated with each classifier, 
shown in equation 1 . 



i>0 



(1) 



The prediction array is the fitness-weighted average of 
the calculated predictions for each possible action. The pre- 
diction array is then used to decide on an action to take (in 
traditional XCSF this is deterministic during an exploit trial 
and random during an explore trial). Once an action is se- 
lected, all classifiers that advocate the selected action form 
the action set [A] . The action is taken and, if the goal state 
is reached, a reward is returned from the environment that 
is used to update the parameters of the classifiers in [A] . A 
discounted reward is propagated to the previous action set 
[A_i] if it exists. During reinforcement, rather than updat- 
ing the classifier's prediction value, the prediction weight 
vector of each classifier in the action set is updated using a 
versi on of t he delt a rule. Prediction error is then updated as 
in dWilsonlEoOlak . 

XCSF makes use of macroclassifiers to enhance com- 
putational efficiency. A macroclassifier represents a number 
of virtual single classifiers, or microclassifiers, and its nu- 
merosity indicates the number of identical microclassifiers 
that the macroclassifier represents. This provides a compu- 
tational advantage over representing each microclassifier in- 
dividually. Subsumption is the mechanism by which macro- 
classifiers are formed; XCSF uses two forms of subsump- 
tion. GA subsumption checks whether a child condition is 
already represented by more general parent, and increases 
the parents' numerosity by one if this is the case. Action 
set subsumption checks the most general classifier in [A] 
against all other members of [A], and increases its numeros- 
ity by one if its condition is more general than theirs. In 
either scenario, the subsumed classifier is deleted from [P]. 
The GA may then activate if the average time since the last 
GA application to the classifiers in [A] exceeds a thresh- 
old Gga- Traditionally in XCSF, two offspring classifiers are 
generated by reproducing, crossing, and mutating the par- 
ents. The offspring are inserted into the population, two clas- 
sifiers are deleted if the maximum population is reached. 
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As happens in all other models of classifier systems, parents 
stay in the population competing with their offspring. 



4 Neural XCSF 

Transitioning to a neural classifier representation (N-XCSF) 
requires modifications to be made to the algorithm detailed 
in Sect. 3. It should be noted that the use of neural classi- 
fiers only alters the matching, covering and action selection 
mechanisms. It is also important to note the distinction be- 
tween "connection weight", which refers to a node-to-node 
weight within a neural network and "prediction weight", 
which is used as in XCSF to calculate classifier prediction 
linearly (after I Wilson! (12001 ah ). 

By using a neural network to replace the condition, and 
calculate the matching and action of a classifier, we increase 
generalisation by allowing the classifier to advocate hetero- 
geneous actions in different regions of the problem subspace. 
The input state, s t , is used by the classifier in two regards: 
(i) to calculate whether the classifier matches in the problem 
subspace, and determine the action it advocates if it does 
match (ii) to calculate prediction values for the classifier in 
a given problem subspace in which it matches. 

For all experiments considered here, output neurons (real- 
valued numbers in the case of MLPs, spike trains in the case 
of SNNs) must be discretised into having either high or low 
activation for the purposes of action calculation. Each net- 
work comprises a problem-dependent number of input neu- 
rons, a number of hidden layer neurons under evolutionary 
control (see Sect. 5.1), and three output neurons. The first 
two ou tput neuron s are used to calculate the action. Simi- 
larly to lBullI (120021) . the final output neuron is a dont-match 
neuron that excludes the classifier from the match set if it 
is highly activated. This is necessary as the action of the 
classifier must be re-calculated for each state the classifier 
encounters, i.e. a classifier may advocate different actions 
in different regions of state space. Covering is altered to re- 
peatedly generate random networks until the network action 
matches the desired output for a given input state. Fig. Q] 
shows how a neural classifier relates to a traditional classi- 
fier condition and action. 



Hidden layertopology under 
evolutionary control 



4.1 MLP Classifiers 

The MLP network is a highly abstracted neural model that 
can produce real- valued outputs from real- valued inputs. Each 
neuron uses a sum-over-inputs, constrained by a sigmoid 
function, to produce an output value. Each classifier condi- 
tion/action pair is represented by a vector that is realized as 
a feed-forward MLP. Each weight in this condition is uni- 
formly initialized randomly in the range [-1, 1] so that a 




□utputaction 



"Don't match" node 



Fig. 1: Detailing the mapping between a neural classifier 
and a traditional classifier. A legal connectivity pattern for 
a SNN classifier is shown. 



node can have both positive and negative postsynaptic con- 
nections. Networks are initially fully-connected. MLP net- 
works cannot have recurrent connections, nor can they have 
connections within the same layer. High activation is > 0.5, 
low is <0.5. 



4.2 SNN Classifiers 

Spiking neural networks present a biologically plausible phe- 
nomenological model of information processing in the brain. 
Mathematical models of the most basic current spiking neu- 
ron, the Leaky Integrate and Fire (LIF) neuro n, can be traced 
back to as early as 1907 ( Lapicquel 1 19071) . We base our 
spiking implementation on the LIF model, although it must 
be stressed that our model is heavily simplified in terms of 
the number of simulation steps used per action calculation. 
Newly initialized networks are fully connected, and both re- 
current connections and hidden-to-hidden layer connections 
are legal. Each connection weight is uniformly initialized 
randomly in the range [0, 1] and cannot become negative 
via GA action, as the type of node determines the nature 
(positive or negative) of the connection. Hidden layer nodes 
are initially excitory with 50% probability, otherwise they 
are inhibitory. All input and output layer nodes are excitory. 
Action calculation involves the current input state being pro- 
cessed five times by each network. Each output neuron has 
an activation window that records the spike train produced 
by that neuron over these five steps. To calculate actions, 
we classify an output spike train as highly activated if it has 
spikes in over half of the positions in the sampling window 
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after five runs (3-5 spikes); otherwise it is said to have low 
activation. 

Neurons can be stimulated either by an external current 
or by spikes from other connected neurons. Each neuron has 
an internal state (or membrane potential, m) which acts as 
a measure of its current activation level. As spikes are re- 
ceived by the neuron, the value of m is altered. If m sur- 
passes a threshold, mthresh, the neuron sends a spike with 
value 1 to all connected neurons, who then receive a reduced 
current relative to the connection weight between those two 
neurons. A leak constant is used to gradually reduce m to 
over time (m is always >0). At time t, the membrane poten- 
tial of a neuron is given in equation 2: 



m(t + \) = m(t) + (/ + a - bm(t)) 
if(m(t) > mthresh)m(t) = c 



(2) 
(3) 



Equation 3 shows the reset formula. Here, m(t) is the 
membrane potential at time / is the input current to the 
neuron, a is a positive constant, b is the degradation (leak) 
constant and c is the reset membrane potential of the neu- 
ron. Normally, spiking neurons have their initial membrane 
potentials m set to the membrane reset value c. We em- 
ploy a bootstrapping mechanism whereby the initial mem- 
brane potential of every network is set to cJni rather than c, 
where cJni > c. This allows the networks to achieve their 
first spike quicker, giving less setup cycles as the networks 
begin to respond more strongly to the inputs, allowing us 
to (i) use smaller window sizes at the output neurons - a 
performance-enhancing measure (ii) benefit from faster gen- 
eral emergence of temporally- sensitive activation patterns 
within the networks. Note that after an initial spike, neurons 
are reset to c and not cJni; boot- strapping only affects the 
time to first spike of each neuron in the network. 



mutation rates as in (B ull et a 1, 2000), to dynamically con- 
trol the amount of genetic search, or frequency of mutation 
events, taking place within the niche. This provides stability 
to areas of the problem space that are already "solved" as the 
mutation rate for a niche is typically directly proportional to 
its distance from the goal state during learning; generaliza- 
tion learning, along with the value function learning, occurs 
faster nearer the goal state. Here, the ji value (rate of muta- 
tion per allele) of each classifier is initialized uniformly ran- 
domly in the range [0,1]. During a GA cycle, a parent's \i 
value is modified as in equation 4 - the offspring then adopts 
this new /i, with (0 < \i < 1), mutates itself by this value, 
and is inserted into the population: 



#(o,i) 



(4) 



5 Topology Mechanisms for Neural Networks 

A classic problem in neural networks revolves around net- 
work topology considerations; how many neurons should a 
network consist of? How should we configure their topo- 
logical arrangement and inter-neural connectivity patterns to 
ensure acceptable performance? In addition to self-adaptive 
mutation, in this work two evolutionary topology morphol- 
ogy schemes are applied to allow the modification of the 
spiking networks in two regards: by adding/removing hid- 
den layer neurons; and adding/removing inter-neural con- 
nections. This allows each classifier to control its own knowl- 
edge representation autonomously in terms of both frequency 
and range of mutation t hat ta kes place in a given niche at 
a given time (iBull et a 1, 2000), and by adapting the hidden 
layer topology of the neural networks to reflect the com- 
plexity of th e problem sub- space considered by the network 
(lBullLl2002h . 



4.3 Discovery Component 

In N-XCSF, the GA is modified to be a two-stage process. 
Stage 1, detailed in the following paragraph, controls the 
rates of mutation and constructivism/connection selection 
that occur within the system, with stage 2 (Sect. 5) con- 
trolling the evolution of neural architecture in terms of both 
neurons and connections. In N-XCSF we use mutation ex- 
clusively to explore connection weight space; crossover is 
omitted as experimental evidence thus far su ggests that, in 
neural classifier systems faoward et all 2008), sufficient so- 
lution space exploration can be obtained via a combination 
of self-a daptive connection weight and topology mutations 
(see also ( Rocha et"al 2007 )). However, we note that a num- 
ber of suitable neural c rossover operators have previous ly 
been presented (e.g. by Stanley and Miikkulainenl ( 2002 )). 
A GA is periodically triggered in [A] to evolve fitter clas- 
sifiers in an environmental niche. We utilise self-adaptive 



5.1 Constructivism 



Constructivist learning (e.g., dOuartz and Sejnowski . 1997 )) 
postulates that neural structures are initially small and sparsely 
connected. Learning acts to add appropriate structure, in the 
form of neurons and connections, until some satisfactory 
level of computing power is attained; suitable specialized 
neural structures emerge as a result of the learners interac- 
tion with its environment. The implementation of construc- 
tivism in this s ystem is ba sed on the aforementioned work 
in neural LCS jBullL 120021) . Each rule has a varying number 
of hidden layer neurons (initially 1, and always > 0), with 
additional neurons being added or removed from the single 
hidden layer depending on the constructivism element of the 
system. 

Constructivism takes place during a GA cycle, after mu- 
tation. Two new self-adaptive parameters, \j/ (0 < y/ < 1) 
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and CO (0 < CO < 1), are added. Here, \j/ represents the prob- 
ability of performing a constructivism event and CO is the 
probability of adding a neuron, with removal occurring with 
probability (1 — CO). As with self-adaptive mutation, both are 
initially randomly generated uniformly in the range [0,1], 
and offspring classifiers have their parents \j/ and CO values 
modified during reproduction as with ji. SNN nodes cre- 
ated during constructivism are initially excitatory with 50% 
probability sampled from a uniform distribution, otherwise 
they are inhibitory. We have previously shown the utility of 
this approach in c ontrast to using fixed- size networks (e.g. 
Howard et all d2008k ). 



5.2 Connection Selection 

Feature selection (FS) is a method of reducing the number 
of the data inputs to a process by selecting and operating ex- 
clusively on a subset of those inputs. A popular FS variant 
in the machine learning community is automatic FS. Au- 
tomatic FS includes both wrapper approaches (where fea- 
ture subsets can be tailored during t he running of the algo- 
rithm, e.g.. iKohavi and Johnl (1997)) and filter approaches 
(where subset selection is a pre-processing step and sub- 
sets are immutable during the running of the algorithm, e.g. 
Koller and Sahamil (119961) ). The use of FS in a neural con- 
text corresponds to enabling/disabling connections between 
the input and hidden layer, and brings two major benefits. 
Firstly, the amount of data being input to a process can be re- 
duced (increasing computational efficiency), secondly noisy 
connections (or those otherwise inhibitory to the successful 
performance of the system) can be disabled. The connec- 
tion structure of artificial neural networks was first evolved 
by Dolan and Dver lDolan and Dveri (11987b. A com parative 
summary can be found in Schlessinger et all ( 2005 ). where 
many neuro-evolution methodologies are compared by the 
authors. Explicit FS within LC S has been in vestigated by 
Bull and Adamatzkvl d2QQ7b and lBacardit et all d2007l) . 

In this paper we allow any connection to be individually 
enabled/disabled, a mechanism termed "Connection Selec- 
tion". Connection selection is implemented in our system 
as follows: During a GA cycle, and based on a new self- 
adaptive parameter % (0 < T < 1 ) (which is initialized and 
self-adapted in the same manner as the other parameters), an 
enabled connection can be disabled, or vice versa. If a con- 
nection is enabled, its connection weight is randomly ini- 
tialised uniformly in the range [0, 1]. All connections are 
initially enabled for new classifiers and classifiers created 
via cover. During a node addition event, new connections 
are enabled probabilistically, with P(connection enabled) = 
0.5, with connection weight randomly set uniformly in the 
range [0,1] as before. We previously have shown the util- 
ity of this approach for reducing network complexity (e.g., 
Howard and Bulll d2008b ). 



For clarity, we now summarise the steps involved in a 
GA cycle. First, two offspring networks are selected. The 
self-adaptive parameters for those networks are altered as 
in equation 4. Connection weights are altered based on ji, 
then node addition/removal takes place based on \j/ and CO. 
Finally, connections are added or removed based on T. These 
networks are inserted into the population and networks are 
deleted as required. 



6 Experiments in Continuous State Space 

Experi ments were conducted o n the continuous 2-D grid 
world ( Boy an and Moore . 1995k. a standard c ontinuous Re- 
inforcement Learning (RL) ( Sutton and Bartol 1998 ) test prob- 
lem. We com pare to our earlier work using an MLP-b ased 
N-XCSF (see iHoward et all d2008b : iHoward etall d2009l) for 
details). Each experiment was comprised of a number of tri- 
als. A trial was defined as starting when the agent is ini- 
tially positioned randomly in the environment, consisting of 
a number of [M] and [A] formations as the agent navigated 
in the environment, and finishing either with the agent reach- 
ing the goal state and receiving reward, or the trial timing out 
after the agent has moved 200 steps (a time-saving measure). 
Each trial was either i n exploration mode (roulette wheel ac- 
tion selection, see e.g. lHoward et all d2008 ) for reasoning) or 
exploitation mode (deterministic action selection). 



6.1 Continuous Grid World 

In the Grid World, the agent's current state, s t , is defined 
as its (x,y) position, each bounded in the range [0,1]; any 
movement outside of this range takes the agent to the near- 
est grid boundary. Increasing environmental complexity, and 
to emulate the sensory noise encountered in the real world, 
both the perceived x and y position of the agent are subject to 
noise; +/- [0%-5%] of the actual position. The agent moves 
a predetermined step size (0.05) in one of four directions 
(North, East, South, West). The goal state is in the top-right 
hand corner of the grid where (jc + y > 1.90). The agent is 
intiallty placed anywhere except the goal, and must navigate 
to the goal in the fewest possible steps (the average optimum 
is 18.6 steps), whereupon it receives an immediate reward of 
1000 and the next trial begins. All other actions give an im- 
mediate reward of 0. The environmental discount rate, y, is 
set to 0.95. 

Each experiment consisted of 20,000 trials, 10,000 ex- 
plore and 10,000 exploit. Experiments were repeated ten 
times with the results being the mean average of these 10 
runs. At the end of each exploitation trial we performed an 
additional exploitation trial from an arbitrary fixed location 
far from the goal state (0.25, 0.25) to determine when the 
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system reached a stable solution. This location is also sub- 
ject to noise; +/- [0%-5%] of the actual agent position. We 
defined the system to be stable if, for 50 consecutive trials, it 
provided an optimal steps-to-goal value when starting from 
the selected location. That measure of stability allowed us 
to perform standard t-tests to compare the respective perfor- 
mances of the two neural representations. 

The current state of the system was analyzed every 50 
trials and used to create the figures. In the "steps-to-goal" 
figures, the dashed line indicates optimal performance, all 
averages are averages from [P]. In the following tables, "Sta- 
bility", is the average time the system takes to reach stability. 
"Connectivity" is defined as the average percentage of en- 
abled connections per classifier in [P]. "Mutation" is the av- 
erage final self-adaptive ji parameter in [P]. "Neurons" are 
the average final connected hidden layer neurons per clas- 
sifier in [P]. "Macroclassifiers" shows the final number of 
macroclassifiers contained in [P] . 

Action calculation proceeds as follows: the outputs at the 
two output neurons (not the "don't match" node) are mapped 
to a single discrete movement in one of four compass di- 
rections (North, East, South, West). Four possible directions 
and only two ranges of discrete output are possible: low and 
high. The combined actions of the neurons translate to a dis- 
crete movement according to the two motor output strengths 
(high, high) = North, (high, low) = East, (low, high) = South, 
and (low, low) = West. 

Spiking N-XCSF was parameterized as follows: popu- 
lation size Af=20000, learning rate /3=0.2 (0 < j3 < 1), ac- 
curacy threshold £o=0.005, GA threshold #ga=50, deletion 
threshold 0£>£l=5O, XCSF constant xo=l, XCSF learning 
rate 77=0.2 (0 < n < 1). All other XCSF parameters fol- 
low (Wilson, 2001a ). Spiking parameters are a=03, Z?=0.05, 
c-0, cJni=0.5, mthresh=l.O and output window size=5. All 
networks initially have a single hidden layer neuron. 

Table Q]shows t-test results when comparing the two net- 
work representations in solving the continuous grid world. 
The first row shows that time to stability is not statistically 
significantly affected by network type (p-value 0.13) (al- 
though the spiking version does have a lower mean aver- 
age also compare Figs. [3a) and [2a)). The internal prob- 
lem representation, in terms of both self-adaptive mutation 
rate (p=8.6x 10 -10 ) and number of neurons added via con- 
structivism , (p=1.5xl0 -4 ) is formed in a statistically sig- 
nificantly different manner. As is typical in our construc- 
tivist approach, we note context sensitive structures being 
formed during the learning process, in two main ways (i) 
where the context is the network type: certain topological 
arrangements being favoured by either SNN or MLP net- 
works and appearing more frequently in the final solutions 
(ii) where the context is the spatial position: the evolution of 
similar "processing groups" of neurons at certain locations 
in the problem space, within solutions of the same network 



Table 1 : Detailing t-test results in MLP and spiking versions 
of N-XCSF in the continuous grid world 



Metric 


Network type 


Average 


P-value 


Stability 


MLP 


11453.5 






Spiking 


8280.5 


0.13 


Mutation 


MLP 


0.45 






Spiking 


0.108 


8.6xl0" 10 


Neurons 


MLP 


2.01 






Spiking 


4.623 


1.5xl0" 4 


Connectivity 


MLP 


84.95 






Spiking 


72.33 


5xl0" 4 


Macroclassifiers 


MLP 


11237.3 






Spiking 


3006.9 


1.08 xlO" 6 



type. Perhaps the most striking result is that the number of 
macroclassifiers used is significantly lower in the SNN case 
(p=1.08x 10 -6 ), indicating that spiking networks are capa- 
ble generating more compact overall solutions. 

Although more nodes are required for a spiking repre- 
sentation (Fig. Ob) shows 2.01 connected nodes, Fig. [2b) 
shows 4.62 connected nodes), the spiking networks are less 
connected (p=5xl0 -4 ; Fig. Od) shows 85% connectivity, 
Fig. (2d) shows 72%). Self-adaptive parameter values are 
lower in the spiking case (Fig. 0c) values range between 
0.3 and 0.43, Fig. He) values range between 0.1 and 0.32), 
especially the three parameters concerned with actual rates 
of mutation (e.g. /i, t). These universally lower mutation 
rates indicate that a more stable solution is being evolved by 
the spiking networks, in terms of the classifiers used in the 
final solution having less chance of mutation, and less allelle 
variation per mutation event, in the SNN case. The more uni- 
form curvature observed in Figs. Ob) and [2c), as opposed 
to Figs. Ob) and[3fc), indicates that not only is the final so- 
lution more stable, the evolution process itself is made more 
stable. Spiking performance see ms to compare favo urably to 
an interval representation (e.g. ( Lanzi et al 2005a), without 
noise), especially since the spiking networks produce com- 
pact solutions in the presence of sensory noise. 

We additionally compared spiking N-XCSF to a tabu- 
lar Q-learner (as in (Lanzi et all l2005bb for XCSF) with dis- 
count rate y = 0.99. We performed the optimal spatial dis- 
cretisation (in ter ms of Q-learner performance) as given in 
( Loiaconol 2010b to discretise the continuous space into a 
21 x 21 grid for the x and y coordinates that comprise s t , 
as shown in equation 5. The purpose of this experiment, as 
well as the experiment presented towards the end of Sect. 8, 
is to provide an indicator of the performance of our system 
against a baseline reinforcement learner. 



disc-X = (int)(floor(cont _x x 1/ {step size))) 



(5) 



Q-learning rapidly achieves optimal performance within 
500 trials. This result significantly outperforms our spiking 
N-XCSF (p=3.2x 10~ 3 , with an average time to stability of 
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Fig. 2: Continuous grid world (a) Steps to goal, (b) average connected hidden layer nodes, (c) average self-adaptive parameter 
values, (d) average enabled connections in spiking N-XCSF 
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Fig. 3: Continuous grid world (a) Steps to goal, (b) average connected hidden layer nodes, (c) average self-adaptive parameter 
values, (d) average enabled connections in MLP N-XCSF 
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8280.5 in the spiking N-XCSF case, compared to 82.6 for 
the tabular Q-learner). However, Q-learning requires a suit- 
able discretisation of state space to be decided on before- 
hand. The Q-value learned is identical to the payoff value 
learned in XCSF, the difference being the use of function 
approximation in the latter. 

7 Taking Time into Consideration 

Reinforcement learning methods typically assign a value to 
each possible state-action combination of a given task. When 
a programmer is prepared or able to define the state space 
discretisation a priori, this methodology has been prove n 
to work for robot systems (IMahadevan and Connelll 1 19921) . 
The approach is however labour intensive and becomes less 
tractable when the state space complexity increases. There 
are numerous accepted methods by which generalization of 
state spaces can be achieved, two of the most popular being 
gradient descent methods a nd line ar approximati o n (specifi- 
cally t ile coding) (e.g., (ILinUl99ll) . (IThamlll995h ). ISantamaria ( 
(119981) extended the latter approach by considering continuous- 
duration action spaces. Within LCS, the approach has tra- 
ditionally been either to predefine the te mporal duration of 
an action ( Dorigo and ColombettiL 1994) or to predefine the 
amount by which a sensor reading must change before a 
change in state is said to have occurred (iCliff and Rossi 1 19941) . 

Motivation for adding actions of a continuous duration is 
to more closely bridge the gap between simulation and phys- 
ical implementation; an agent samples its sensors with low 
latency, and reacts accordingly. Our goal is to go some way 
to model these rapid sensory updates by facilitating them in 
our learning architecture. In this case, we employ a system 
whereby an action set can control the agent for more than 
one continuous -duration action, which can be comprised of 
many discrete actions, updating the computed action for the 
classifiers in [A] as new environmental states present them- 
selves. The LCS can therefore create high-level continuous 
macro actions from chains of lower-level actions. As the 
LCS can search the space of possible macro actions, the 
need to predefine such actions is removed. 

Our system is based on the Temporal LCS (TCS) w hich 
has been used with both ZCS (ori ginally ( Wilso n, 1994), im- 
pleme nted in (Hurst et al, 2002)) and, more recently, with 



formed as normal. If only some classifiers match, [A] is split 
into two new sets, [C] (the continue set) and [D] (the drop 
set), so that each classifier has dual-set membership; [A] and 
either [C] (if it still matches) or [D] (if it does not). Roulette 
wheel selection based on fitness-weighted predicted payoff 
is then used to pick a classifier from [A], and its member- 
ship of either [C] or [D] is used to determine the whether 
the system continues or drops the current action. 

- If [C] wins we continue, removing [D] classifiers from 
[A]. 

- If [D] wins we remove [C] classifiers from [A], perform 
necessary parameter updates, then drop [A] and form a 
new [M]. 

As we continually calculate actions in [A], some clas- 
sifiers may advocate different actions to the one originally 
used to comprise [A] from [M]. In this case, a "winning" ac- 
tion is picked from [A] based on the action selection policy 
XLLthe trial (roulette during exploration, deterministic during 



et e^ploitation). All classifiers that do not advocate the newly- 
selected action are removed from [A] but not added to the 
drop set; as the outcome of using their advocated action is 
not explored, an accurate prediction value cannot be ascer- 
tained. Traditionally, reinforcement in LCS is given with the 
formula in equation 6. Here, r is the immediate reward, y is 
the discount factor and maxP is the maximum of the predic- 
tion array. The reinforcement update is altered to consider 
the amount of time an [A] maintains control of the LCS and 
the global time taken to achieve reward: 



p = r + 7 x maxP 

P = (e~**)r+ (e~ ptl ) x maxP 



(6) 
(7) 



XCS dStudley and BullL 12005b and with XCSF and MLPs 
et aj l2QQ9l) 7ln TCS, the match set [M] and the 



(Ho ward ( 



action set [A] are formed as usual. Subsequent input states 
are then fed into [A] as the agent traverses the environment, 
without reforming [M] as in normal LCS. If all classifiers 
in [A] still match the input, the advocated action is taken 
and next input state retrieved. If no classifiers in the current 
[A] match the newly presented input state, or a timeout limit 
since the [M] formation is reached, the action is dropped 
(with reinforcement and GA activity) and a new match set 



Equation 7 shows the TCS reinforcement forumula; the 
first and second reward factors, e~^ and e~ pt , favour effi- 
cient overall solutions (in terms of the overall number of dis- 
crete actions) and efficient state transitions (between continuous- 
duration actions) respectively, (p and p are experimentally- 
determined discounting factors, t* is the total number of steps 
taken in the trial, and t l is the number of steps taken since 
the last match set formation. 



8 Experiments in Continuous Time and Space 

In the following experiments we prove the ability of SNN 
to handle continous time as well as continuous space. We 
validate our approach on two well-known RL test problems, 
the Grid World described in Sect. 7 and the "mountain-car" 
problem. It is important to note that we only reset the net- 
work membrane potentials (m) for every node once at the 
beginning of each trial. In this way, we hoped to exploit 
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the temporal dynamics inherent to spiking neural representa- 
tions to solve these continuous time, continuous space envi- 
ronments more effectively by preserving network states be- 
tween action set formations. In contrast to experiments in 
Sect. 7, the consideration of time classes the problems as 
semi-MDPs. 



8.1 Continuous Grid World 

In the Grid World, TCS parameters were set as: <p=0.45, 
p =0.005, timeout=20. All other parameters were identical 
to those in Sect. 7. As with the previous experiments in- 
volving the Grid World, we compare our SNN TCS imple- 
mentation to an MLP-based TCS. Table [2] describes the t- 
test results for SNN and MLP implementations; showing 
no significant performance difference between the two neu- 
ral representations (p=0.57). Fig. |Ha) shows optimal per- 
formance after 2000 trials for SNN, compared to 5000 tri- 
als in the MLP case, which is shown in Fig. [3a). No sig- 
nificant differences are found in terms of connected hidden 
layer nodes (p=0.51). Fig. H^b) shows 2.4 connected hidden 
layer nodes, and Fig. Ob) shows just over 2 connected hid- 
den layer nodes, indicating that the SNN representation re- 
quires larger hidden layer sizes on average. In terms of both 
mutation stability (p=1.5x 10 -4 ) and number of macroclas- 
sifiers (p=1.5 x 10 -3 ) the spiking version held the advantage 
in a statistically significant manner, attesting to the benefits 
of a SNN representation over an MLP one. In terms of self- 
adaptive parameters, Fig. I^c) values ranged between 0.22 
and 0.44 and Fig. [3c) values ranged from 0.48 to 0.33 - 
all self-adaptive parameters were lower in the spiking case; 
again, an indicator of improved stability in terms of less mu- 
tation events on average in the evolved SNN classifiers. 

Figs.ffid) and [3d) show the differences in connectivity 
between SNN (86%) and MLP (84%) networks; revealing 
the spiking to be more densely connected, although not sta- 
tistically significantly so (p=0.065). It is also interesting to 
note that although the transition to TCS in the spiking case 
did not statistically increase solution stability it decreased 
the average number of neurons required by the networks 
with a p- value of 0.016, highlighting the ability of a SNN 
to produce temporally dynamic activity, as fewer process- 
ing units were required when the temporally- sensitive SNNs 
were coupled with a semi-MDP test problem. This indicated 
that the SNNs were harnessing temporal information con- 
tained in the test problem to solve it in a different manner. 

It was interesting to observe that the spiking networks 
seemed more predisposed to allow for multiple actions to 
be selected within a single match set formation, as opposed 
to the more ho mogenous action sel ection evidenced in MLP 
networks (e.g. ( Howard et all 2008 )). Fig. shows heteroge- 
neous action selection in the spiking case allowing a single 
match set to control the agent all the way to the goal state, 



Table 2: Detailing t-test results in TCS-enabled MLP and 
spiking versions of N-XCSF in the continuous grid world 



Metric 


Network type 


Average 


P-value 


Stability 


MLP 


5687.5 






Spiking 


5637.67 


0.57 


Mutation 


MLP 


0.43 






Spiking 


0.25 


1.5 xlO" 4 


Neurons 


MLP 


2.09 






Spiking 


2.40 


0.51 


Connectivity 


MLP 


83.93 






Spiking 


86.13 


0.065 


Macroclassifiers 


MLP 


18148.1 






Spiking 


16552.25 


1.5 xlO" 3 



1.0 



0.75 



0.5 



1 



MLP palh 

Spike paiii 



0,25 0.5 075 1.0 
grid x value 



Fig. 6: Sample action discretisations in spiking and MLP 
networks in the continuous grid world 



contrasting with the MLP solution of more homogenous ac- 
tion selection; first taking the agent to the rightmost border, 
then progressing up the border to the goal state. In situations 
where both network types provided heterogeneous actions 
from the same match set, the spiking networks were more 
likely to sequentially switch selected action from one step to 
the next. Fig. shows that, given the actions North, South, 
East, West in the continuous grid world, the spiking repre- 
sentation gave (N, E, N, E, N, E...) whereas the MLP rep- 
resentation provided (N, N, N... E, E, E...). The more com- 
plex discretisations available to the spiking representation 
gives a clear benefit of using a spiking representation over 
an MLP representation, when considering the prospective 
application of the system to more complex environments 
where highly varied, heterogeneous action selection may be 
required for optimal performance. 



8.2 Smaller Step Size Environment 

In preparation for a move to a simulated robotics platform, 
we decreased the step size tenfold to 0.005. Motivation for 
this experimentation was seen as testing the scalability of 
our temporal representation, as a real robot would read sen- 
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Fig. 4: Continuous grid world (a) Steps to goal, (b) average connected hidden layer nodes, (c) average self-adaptive parameter 
values, (d) average enabled connections in spiking TCS N-XCSF 
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Fig. 5: Continuous grid world (a) Steps to goal, (b) average connected hidden layer nodes, (c) average self-adaptive parameter 
values, (d) average enabled connections in MLP TCS N-XCSF 
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sors many times per second. The new step size was more 
akin to the number of readings a real robot may be required 
to make. It may also be necessary for the agent to perform 
homogenous action selection in extended regions of the en- 
vironment, followed by certain highly heterogeneous areas 
where more complex behavioural policies, such as obsta- 
cle avoidance, are required; TCS allows for discretisation 
of state space based on required actions, as well as poten- 
tially altering the action of a given [A] through multiple state 
transitions in response to the required action transition fre- 
quency. Finally, a smaller step size introduced an environ- 
ment where a TCS classifier system should be able to per- 
form optimally, whereas a traditional Q-learner struggled. 

We increased the timeout value from 20 to 200, giving an 
optimal average steps-to-goal value of 1 .5. To prove the scal- 
ability of the system, all other LCS parameters are identical. 
Results of comparisons between the spiking TCS-enabled 
systems within the step size 0.05 and 0.005 environments are 
presented in Table [3 and Figs. |H and [3 Both systems solve 
the task optimally. However, the performance of the system 
with step size 0.005 is statistically better (p=2.42x 10" 6 ). 
A possible explanation for this is that, as the average num- 
ber of discrete movements an agent is required to make is 
much greater than in the step size 0.05 case, the SNN net- 
works have more opportunity to use the temporal informa- 
tion in this semi-MDP, resulting in a performance differ- 
ence. Fig. |7Ja) shows the performance of the system in the 
step size 0.005 environment. Commencing from an average 
steps-to-goal value of 5.8, the system initially produces fluc- 
tuating results until around 2500 trials; stability is thereafter 
attained. Final solutions show that more macroclassifiers are 
required in the smaller step size environment in an almost 
statistically significant manner, p=0.03. This indicates that 
more network variety is required in a more granular environ- 
ment. Both step sizes produce solutions with a similar num- 
ber of connected nodes, although the smaller step size does 
induce a slight growth in node- wise network complexity (2.4 
connected hidden nodes on average in Fig.^b), contrasting 
with 2.5 nodes on average in Fig.[3b)); p=0.15. Again, the 
curvatures are similar, but final values differ slightly. A pos- 
sible explanation for the increased number of hidden nodes 
is that, as the step size 0.005 environment potentially pro- 
vides more latent temporal information per trial (as more 
discrete steps are required on average), more hidden nodes 
(temporal processing units) are evolved by the networks to 
usefully process this latent information. 

Immediately obvious were similar parametric trends ap- 
pearing in both sets of graphs. For example, the line cur- 
vatures for self-adaptive parameters described in Figs. [He) 
and |3c) are similar in shape, and follow an identical de- 
scending order (ft), fl, t) the only difference being that 
the parameter values in Fig.^c) are slightly lower in general 
(reaching values of 0.44, 0.29, 0.25 and 0.22 respectively 



Table 3: Detailing t-test results in TCS spiking N-XCSF in 
the continuous grid world with step sizes 0.05 and 0.005 



Metric 


Step size 


Average 


P-value 


Stability 


0.05 


5637.5 






0.005 


1244.13 


2.42 xlO" 6 


Mutation 


0.05 


0.25 






0.005 


0.35 


0.01 


Neurons 


0.05 


2.40 






0.005 


2.55 


0.15 


Connectivity 


0.05 


86.13 






0.005 


85.09 


0.36 


Macroclassifiers 


0.05 


16552.25 






0.005 


18384.13 


0.03 



with step size 0.05, compared with 0.48, 0.39, 0.36 and 0.31 
in the step size 0.005 environment). Differences between 
the self-adaptive parameter values in Fig. [3c) can only be 
seen after approximately 10000 trials. Table [3] reveals that 
the final average self-adaptive mutation values vary statis- 
tically significantly between the different step size environ- 
ments, p=0.01. Fig. [3d) shows an average network connec- 
tivity of approximately 86%, whereas Fig.[4td) shows 81%, 
p=0.36. The number of macroclassifiers required does not 
vary significantly between the trials (p=0.03). Despite simi- 
lar trends, differing values again highlight the self-adaptive 
nature of the learning process, which alters depending on the 
environment the system is presented with. 

As in Sect. 6, we compared performance to that of a 
baseline tabular Q-learner with step size 0.005 and discount 
rate y = 0.99. We performed the optimal performance-giving 
discretisation as given in equation 8 to discretise the contin- 
uous space into a 201 x 201 grid based on this new step size. 
With a steps-to-goal value always >400 (optimal 181.4, re- 
sults not shown), Q-learning was unable to solve this more 
challenging environment within the allotted timeframe. The 
primary reason for this failure can be explained as too fine 
a sampling granularity within the environment - the system 
could not generate an accurate payoff map of the environ- 
ment as there were too many discount steps in an average 
trial to successfully utilise Q-learning within a reasonable 
timeframe. Hence we show the effectiveness of the tempo- 
ral nature of TCS the ability harness the temporal sensitivity 
of the SNN classifiers to deal with long action chains (con- 
tinuous actions) that potentially consist of multiple discrete 
actions, with no need to pre-discretise the continuous state 
space, in the presence of sensory noise. 

It should also be noted that these experiments were also 
carried out on the spiking non-TCS and MLP TCS versions 
of our system (results not shown); neither could find the op- 
timal discretisation as evidenced in the spiking TCS version, 
demonstrating the ability of SNNs to harness temporal infor- 
mation and highlighting the suitability of TCS-style func- 
tionality for a simulated robotics environment. 
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Fig. 7: Continuous grid world with step size 0.005 (a) Steps to goal, (b) average connected hidden layer nodes, (c) average 
self-adaptive parameter values, (d) average enabled connections in spiking TCS N-XCSF 



8.3 Mountain-Car 

To demonstrate the general learning ability of spikin g TCS , 
we also tested on the mountain-car problem ( Suttonl 1996 ). 
in which a car must be guided out of a one-dimensional val- 
ley. Reaching the goal state is non-trivial as, in some cases, 
the car must move away from the goal state to attain enough 
momentum to climb out of the valley. State variables were 
position [-1.2, 0.6], and velocity [-0.07, 0.07]. Three actions 
were available: forward (increase velocity), backward (de- 
crease velocity), and n o movement. F or a complete algo- 
rithmic description see ( Suttonl 1996 ). As there were only 
three actions available, SNN outputs were distretised into 
actions as: forward = high, high, backward = low, low, and 
no movement = high, low or low, high. Each experiment con- 
sisted of 5000 trials, 2500 explore and 2500 exploit. The 
agent was initally randomly placed in the environment (ex- 
cept in the goal state), given a random velocity, and had to 
reach the goal state (where the cars position is > 0.5) in 
the fewest possible steps (optimal steps-to-goal =1). Popu- 
lation size Af=1000; TCS parameters were (p=0.45, p=0.005, 
timeout=200. All other parameters were identical to those in 
Sect. 7. 

Fig.Oa) shows attainment of optimum performance within 
100 trials, w hich appears competitive to XCSF with a tile 
coding (e.g. dThamU 19951) ) scheme (Lanzi et all l2006h . It 



should be noted that (Lanzietal, 120061) allows only two 
actions. Average neurons per classifier (Fig. Ob) gradually 
increases to a final value of 1.65. Self-adaptive parame- 
ters decline from their initial values (Fig. Etc)). Final net- 
work connectivity (Fig.Od)) steadily declines to a final av- 
erage value of 84.5%. Action set analysis shows that TCS 
allows the agent to reach the goal state from any intial po- 
sition/velocity combination by forming only one match set. 
Actions are altered from forward to backward as the agent 
builds momentum; this is achieved by the networks recalcu- 
lating their actions based on sensory input to alter the system 
action. The TCS reinforcement formula allows for the gen- 
eration of shortest-length action chains within these macro 
actions within 480 trials. 



9 Moving to Robotics Problems 

At the beginning of Sect. 7, justification for the use of a con- 
tinuous test environment summarised as a desire to more ac- 
curately replicate the types of situations a real robot might 
encounter. Here, we begin with explanation for the transition 
from the continuous maze to a physical robotics simulation. 
Although a standard RL test scenario, the continuous grid 
world is an environment lacking in any real-world complex- 
ity, selected mainly to allow a direct comparison to tabular 
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Fig. 8: Mountain-car (a) Steps to goal, (b) average connected hidden layer nodes, (c) average self-adaptive parameter values, 
(d) average enabled connections in spiking N-XCSF 



Q-learning. Evaluation of our system on a robot simulation 
suite is therefore a logical step forwards. An ov erview of 
current robot simulation platforms can be found in (iCraighead et 
120071) . Our chosen development simulator is the Webots robotics 
platform (IMichelL 120041) . a test bed chosen due to its popu- 
larity in the research community. 



9.1 LCS Robotics research 

The first LCS to succes sfully tackle the world of real robot 
control was the work of Dorigo and Colombettil dl994h . The 
auth ors used a modified version o f Hollands classifier sys- 
tem (Ho lland and Reitman . 1978b , to create a hierarchical 
LCS in which lower-level LCSs can learn simple behaviours, 
which high-level LCSs then coordinate to generat e complex 
actions. MONALYSA dDonnart and Meveii 1 19961) was a hi- 
erarchicial LCS in which the hierarchy itself could be dy- 
namically reconfigured. It was experimentally demonstrated 
that the ability to hierarchically decompose the required be- 
haviour into sub-behaviours produced better performance, 
firstly in a simulated environment and later on a physical 
robot 

Bonarinil ( 1998k presented a fuzzy classifier system that 



scheme similar to Q-learning. The authors reported swift 
learning compared to the traditional ternary class ifier con- 
dition representation. Bonarini and Triannl ( 2001 ) then ex- 
tended this system for cooperation amongst a swarm of agents 
who could explicitly pass messages between themselves within 
a limited range. Another fuzzy classifier system is intro- 
duced by IPipe and Carsd ( 2002 ) for implementation on a 
robotics platform. The system was tested on non-trivial maze 
environments, which were navigated by a physical robot; 
compact rule sets were reported as a benefit of fuzzy classi- 
fier representation. 

Latent learnin g was realized in a classifier system by 



Stolzmannl (119991) , providing a degree of premonition. It was 



demonstrated that his LCS could build chains of classifiers 
without waiting for subsequent environmental inputs by cre- 
ating its own internal representation of the environment. 
Katagami and Yamadal ( 2000b presented a manual ap- 



created sets of fuzzy niches which guided the behaviour of 
the agent, and employed a delayed reinforcement attribution 



proach for inducing certain behavioural patterns as a boot- 
strapping technique for learning in a real robot. Reinforce- 
ment learning was used as normal to decide on the "best" 
actions to take from the initially generated set. The opera- 
tor could also decide to take control in which case system 
created rules that covered the actions that the op erator took, 
although this is obviously supervised learning. Web b et all 
(12003b reported mixed success in XCS control over a simu- 
lated Khepera in Partially Observable Markov Decision Pro- 
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cess (POMDP) maze environments. They attempted to by- 
pass the "aliasing states" problem by augmenting each clas- 
sifier with an internal state register and internal action, which 
were used to differentiate between aliasing states. The LCS 
made use of temporal actions with TCS -like f unctio nality. 
A simple LCS is presented by ICazangi et all (12003k . who 
evolved robot controllers for goal location and object avoid- 
ance tasks in unknown environments. The authors reported 
no significant degradation of performance when switching 
from simulation to a physical agent and also demonstrated 
that successful controllers c an be evolved entirely o n the 
physical robot. More recently, feutz and Herbortl(l2008k demon- 
strated an XCSF-derivative to control a robot arm using a 
real interval representation. Recent research by Moioli et a] 
(12007k compared attempts at T-maze navigation in both sim- 
ulated and real robotics tasks, and presented results that re- 
inforce the idea that TCS is a valid method for dealing with 
long-action chains t hat are present in m any robotic environ- 
ments. Similarly to IWebb et all ( 2003 ). the authors added a 
memory register and tested the system on a non-reactive 
robotics task. Again, high performance and the ability to dis- 
ambiguate perceptually aliased states is reported. It should 
be noted that TCS-based systems hav e previously been used 
explicitly f or real robot n avigation, by lstudley and Bull! d2005k 
and lHurst and Bulll d2006k . 



9.2 Experimentation 




Fig. 9: (a)The test environment. The agent begins in the 
lower-left and must reach a light source (circle) in the upper- 
right, circumnavigating the central obstacle. An example 
agent path is shown (dotted line).(b)Khepera sensory ar- 
rangement. 3 light sensors and 3 IR sensors share positions 
0, 2 and 5. Two bump sensors, Bl and B2, are shown at- 
tached at 45 degree angles to the front-left and front-right of 
the robot. 



the environment as the input state s t , we used readings taken 
from the agents light and distance sensors to model the cur- 
rent environmental state. The agent, a cylinder- shaped robot 
with height = 0.03 and diameter 0.07, was equipped with 
with 8 light sensors and 8 IR distance sensors (see Fig. Ob)), 
plus two bump sensors, offset to the left and right of the 
front of the agent. At each step, the agent sampled its light 
and IR sensors, whose scaled values ranged [0,1]. These val- 



This final set of experiments demonstrates the performance 

of the spiking TCS N-XCSF in a robotics environment (Fig.[aa)) ues then CQm P r j s e^ e in P ut state for the current ste P- After 
The agent, a simulated Khepera II robot, was initially ran- 



domly located within a walled arena which it could not leave 
with coordinates ranging from [-1,1] in both x and y direc- 
tions (all units are in metres). A light source was placed at 
the top-right hand corner of the arena (x=l, y=l, z=l), which 
the agent must approach in order to receive reward. The light 
source was modelled on a 15W bulb with realistic attenua- 
tion values. Adding to the complexity of the environment, 
a three-dimensional box was placed centrally in the arena 
(with vertices on "ground level" (z=0.0 ) at (x=-0.4, _y=-0.4), 
(-0.4, 0.4), (0.4,0.4), and (0.4, -0.4), and raised to a height 
of z=0.15). When the agent reached the reward zone (where 
x + y >1.6), an immediate reward of 1000 was returned and 
the next trial begun. All other movements give an immediate 
reward of 0. The reward boundary of 1.6 was chosen as the 
result of calibration experiments, which revealed the agent 
experienced excessive levels of sensory aliasing when ap- 
proaching the light. All IR and light sensor lookup tables 
were modelled on actual sensor performance from the real 
Khepera robot. 

We altered the environmental representation used by the 
agent to calculate action and predict payoff. Contrasting our 
previous approach of using the agents noisy (x ,y) location in 



j Hurst and Buil Eo06). six sensors were used to comprise 
the input state, three IR and three light sensors at positions 0, 
2 and 5 (see Fig.[9jb)). This extended input state was there- 
after used to calculate [M] membership/actions and compute 
prediction as normal. 

In this experiment we employed our spiking TCS N- 
XCSF to solve a more complex version of the grid environ- 
ment introduced in Sect. 6.1. This environment was much 
more challenging than any we have used previously. For ex- 
ample, the agents wheels could slip, sensory noise was more 
pervasive and more accurately modelled, and the environ- 
mental input was three times larger than that used in the con- 
tinuous grid world experiments. The inclusion of an obstacle 
further increased this complexity. At the start of each trial 
the agents initial random position was further constrained 
so that the obstacle is initially always between the agent and 
the light source. We constrained the initial starting position 
the lower left-hand corner of the environment, where the in- 
equality (x + y < —1.5) is satisfied. 

To demonstrate the problem-independence of the sys- 
tem, parameter settings were largely unchanged from those 
in Sect. 8, with exceptions. First, N was reduced to 3000. 
Due to time and processing constraints, each experiment 
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was limited to 500 trials. Secondly, each classifier was ini- 
tially seeded with 6 hidden layer nodes to offset the lack of 
trials per experiment. Seeding the networks with one neuron 
would make it extremely difficult for the classifier to imme- 
diately and usefully discriminate certain state elements as 
it is assumed that providing 6 state elements instead of the 
normal 2 would require the networks to make more complex 
partitions in state space. To reduce disruption, self-adaptive 
parameters were initially constrained to (0< {jl / \jf/ t) <0.02), 
with (0< CO <1) as normal. Finally, the six hidden layer 
nodes had each initial connection enabled with 50% prob- 
ability. This is intended to increase inter-network variation 
early in the experiment as certain state variables will ini- 
tially have no effect on the output action, thus increasing be- 
havioural diversity. Usually, connection selection performed 
this operation gradually throughout the 20,000 trials. All of 
these modifications were implementated to aid expediency; 
as we completed only 500 trials, the new parameterisation 
allowed the networks to perform useful fucntions immedi- 
ately. Due to the nature of construcitivism, it is assumed that 
the system would eventually be able to learn an optimal pol- 
icy for the given task given enough time when set to its de- 
fault parameters. 

Movement values and sensory update delays were con- 
strained by accurate modelling of physical Khepera agent. It 
should be noted that in the earlier experiments, agent orien- 
tation was irrelevent; here orientation was preserved through 
movements and the agent was able to turn continuously to 
explore the environment. Three actions were possible: for- 
ward, and continuous turns to both the left and right (caused 
by halving the left/right motor outputs respectively). As the 
agent initially explored the environment, it was likely to 
bump into obstacles. If either bumper was activated, an in- 
terrupt was sent causin g the agent to rev erse 10cm and form 
a new [M] (after faurst and BullLl2006l) ). 

Fig. [lOta) shows the steps-to-goal values attained. Start- 
ing from 75 steps-to-goal, the system shows swift attainment 
of near optimal performance in terms of number of match set 
formations per trial after 300 trials, especially considering 
the levels of sensory noise. It is demonstrated that the sys- 
tem could, in most cases, evolve networks flexible enough to 
perform segregations in action space without reforming [M] 
via the ability of the networks to recalculate actions within 
an action set due to differences in received states. 

An initial exploratory phase of action selection involved 
many collisions, which were progressively penalized as they 
caused more [M] formations per trial. The responsible clas- 
sifiers were progressively usurped by classifiers which re- 
tained the same initial [M] by avoiding collisions. Action 
alteration once [M] was formed was usually due to IR sen- 
sors activating and perturbing network performance to turn 
the agent away from an obstacle or towards the light source, 
although the agent was also observed to display the capabil- 



ity to alter action based on light sensor input alone. During a 
trial, action calculation was seen to be altered in two ways; 
networks either evolved to activate the "dont match" node so 
that classifiers advocating certain actions were dropped from 
[A] at the correct time, or state inputs caused the action ad- 
vocated by the majority of the classifiers in the current [A] 
to transition between two states, causing a turn. Networks 
were observed to use the latter method more frequently than 
the former (68% of action switches occurred using the lat- 
ter method), an evolved behaviour to keep more classifiers 
in [A] and hence retain more classifier variety in the action 
set. It was interesting to note that once the correct amount 
of turn was applied to the agent (e.g. in response to IR sen- 
sors highly activating, then deactivating once the obstacle 
was avoided), the action set was observed to re-calculate the 
"correct" action immediately; both methods outlined above 
were observed to be used by the system. Varying amounts 
of turn were observed (as more frequent turn actions inter- 
spersed within a majority of forward actions) as the agent 
approached an obstacle. 

The number of hidden layer nodes are shown in Fig. fTOf b), 
self-adaptive parameters shown in Fig. QHc), and connec- 
tivity shown in Fig. [Tol d). All graphs show that the ini- 
tial parameter values are mainly unaltered, due to the re- 
duction in the number of trials. Performance was seen to 
compare favourably to other TCS robotics experiments, (e.g. 
faurst and Bulll 12006b ). As mentioned previously, by seed- 
ing each classifier with 6 hidden layer neurons, and allowing 
any connection in the network to be initially disabled with 
50% probability, we effectively "jumped" the start of con- 
structive neuro-evolution, allowing the networks sufficient 
topological disparity and neural complexity to begin solv- 
ing this more complex problem immediately. 

Numerous differences existed between the simulation and 
traditional grid world experiments. Firstly, variations in terms 
of state space were seen to present a more difficult experi- 
ment for the simulated robot to navigate; the value function 
in the previous continuous grid world experiments increased 
smoothly as the agent neared the goal state. The value func- 
tion also mapped approximately to the sensory readings per- 
ceived by the agent so that higher values of x and y gave, in 
general, higher payoffs. Conversely, in simulation the robot 
had to deal with less uniform state space due to the short 
range (0.01) of the IR sensors, which could within a short 
amount of time drastically alter the state input to the net- 
works; hence the payoff map did not correlate with its' state 
as simply. The environmental representation was more com- 
plex; the light sensors were easily saturated, and the environ- 
mental was mainly well-illuminated; a necessary implemen- 
tation restriction due to using Webots to ensure that collision 
detection occurs. It was noted that there was sparse use of 
high (>0.5) light sensor values that correspond to very dark 
regions of the environment. Also prevalent was a degree of 
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Fig. 10: (a) Steps-to-goal (b) average connected hidden layer nodes (c) average self-adaptive parameter values (ji/yf/T 
plotted on the right axis) (d) average enabled connections for TCS N-XCSF in the Webots experiment 



sensory noise surpassing the [+/-5%] used in the continu- 
ous grid world environment; IR sensors were [+/-5%] at the 
extremes of their range, light sensors [+/-10%]. 



10 Conclusions 



agent to traverse environments requiring long action-chains. 
As the LCS can search the space of continuous actions, re- 
moving the need to predefine such actions. By adding SNN 
classifiers, the action advocated at a given timestep could be 
recalculated based on sensory input, increasing generalisa- 
tion. 



In this paper we have shown that a self-adaptive neural LCS 
employing constructivism can perform optimally in a more 
complex and noisy version of a standard continuous environ- 
ment, and a continuous noisy robotic simulation. As the net- 
works can calculate their actions, they have the capability to 
carry out the correct action in different areas of the problem 
space, even if that action required is different. Further, using 
the prediction computation of XCSF, we have observed that 
one network can accurately predict payoff in several spa- 
tially disparate regions of the problem space, even when the 
payoff values are different. 

The results of adding TCS showed that when the sys- 
tem was presented with an environment which allowed it 
to harness its temporal capabilities, it performed in excess 
of environments where a temporal element does not exist, 
in other words the system is capable of processing underly- 
ing temporal information in a problem. The main strength of 
the TCS approach was the generation of high-level continu- 
ous actions from simple discrete actions, which allowed the 



Results presented in this paper are intended to reinforce 
the view that the use of self- adaptation and constructivism 
are a means to achieving increased levels of parameter and 
problem independence. XCS and XCSF are notoriously heav- 
ily parameterised systems; self-adaptation removes the need 
to set several parameters and constructivism allows the LCS 
to adapt to the complexity of the problem it is presented 
with. In the case of the transition to Webots, a much more 
complex environmental representation is implemented with- 
out significant parameter change. The transition to Webots 
demonstrated the power of the system to create hetergeneous 
continuous movement sequences from a single match set 
formation, despite levels of sensory noise in simulation be- 
ing in excess of the emulated noise used in the continuous 
maze environment. 
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